AITopics | benchmark study

Collaborating Authors

benchmark study

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

"unified " description and implementation of E(2)-equivariant CNNs in an "umbrella framework " and their "extremely

Neural Information Processing SystemsFeb-12-2026, 01:21:04 GMT

We will further try to simplify the paper in general.

artificial intelligence, description and implementation, general solution, (17 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

"unified " description and implementation of E(2)-equivariant CNNs in an "umbrella framework " and their "extremely

Neural Information Processing SystemsOct-2-2025, 15:41:24 GMT

We will further try to simplify the paper in general.

convolution, description and implementation, general solution, (16 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence (0.31)

Add feedback

No Free Lunch from Audio Pretraining in Bioacoustics: A Benchmark Study of Embeddings

Chen, Chenggang, Yang, Zhiyu

arXiv.org Artificial IntelligenceAug-15-2025

Bioacoustics, the study of animal sounds, offers a non-invasive method to monitor ecosystems. Extracting embeddings from audio-pretrained deep learning (DL) models without fine-tuning has become popular for obtaining bioacoustic features for tasks. However, a recent benchmark study reveals that while fine-tuned audio-pretrained VGG and transformer models achieve state-of-the-art performance in some tasks, they fail in others. This study benchmarks 11 DL models on the same tasks by reducing their learned embeddings' dimensionality and evaluating them through clustering. We found that audio-pretrained DL models 1) without fine-tuning even underperform fine-tuned AlexNet, 2) both with and without fine-tuning fail to separate the background from labeled sounds, but ResNet does, and 3) outperform other models when fewer background sounds are included during fine-tuning. This study underscores the necessity of fine-tuning audio-pretrained models and checking the embeddings after fine-tuning. Our codes are available: https://github.com/NeuroscienceAI/Audio\_Embeddings

artificial intelligence, deep learning, machine learning, (15 more...)

arXiv.org Artificial Intelligence

2508.1023

Country: North America > United States (0.14)

Genre: Research Report (1.00)

Industry: Health & Medicine (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

VIBES -- Vision Backbone Efficient Selection

Guerin, Joris, Bansal, Shray, Shaban, Amirreza, Mann, Paulo, Gazula, Harshvardhan

arXiv.org Artificial IntelligenceOct-11-2024

This work tackles the challenge of efficiently selecting high-performance pre-trained vision backbones for specific target tasks. Although exhaustive search within a finite set of backbones can solve this problem, it becomes impractical for large datasets and backbone pools. To address this, we introduce Vision Backbone Efficient Selection (VIBES), which aims to quickly find well-suited backbones, potentially trading off optimality for efficiency. We propose several simple yet effective heuristics to address VIBES and evaluate them across four diverse computer vision datasets. Our results show that these approaches can identify backbones that outperform those selected from generic benchmarks, even within a limited search budget of one hour on a single GPU. We reckon VIBES marks a paradigm shift from benchmarks to task-specific optimization.

artificial intelligence, backbone, machine learning, (14 more...)

arXiv.org Artificial Intelligence

2410.08592

Country:

South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
North America > United States > Massachusetts (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

A method to benchmark high-dimensional process drift detection

Wolf, Edgar, Windisch, Tobias

arXiv.org Artificial IntelligenceSep-5-2024

Process curves are multi-variate finite time series data coming from manufacturing processes. This paper studies machine learning methods for drifts of process curves. A theoretic framework to synthetically generate process curves in a controlled way is introduced in order to benchmark machine learning algorithms for process drift detection. A evaluation score, called the temporal area under the curve, is introduced, which allows to quantify how well machine learning models unveil curves belonging to drift segments. Finally, a benchmark study comparing popular machine learning approaches on synthetic data generated with the introduced framework shown.

detector, drift segment, process curve, (14 more...)

arXiv.org Artificial Intelligence

2409.03669

Country:

North America > United States > New York > New York County > New York City (0.04)
North America > Canada > Alberta > Census Division No. 15 > Improvement District No. 9 > Banff (0.04)
Europe > Germany (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)

Add feedback

OpenHEXAI: An Open-Source Framework for Human-Centered Evaluation of Explainable Machine Learning

Ma, Jiaqi, Lai, Vivian, Zhang, Yiming, Chen, Chacha, Hamilton, Paul, Ljubenkov, Davor, Lakkaraju, Himabindu, Tan, Chenhao

arXiv.org Artificial IntelligenceFeb-20-2024

Recently, there has been a surge of explainable AI (XAI) methods driven by the need for understanding machine learning model behaviors in high-stakes scenarios. However, properly evaluating the effectiveness of the XAI methods inevitably requires the involvement of human subjects, and conducting human-centered benchmarks is challenging in a number of ways: designing and implementing user studies is complex; numerous design choices in the design space of user study lead to problems of reproducibility; and running user studies can be challenging and even daunting for machine learning researchers. To address these challenges, this paper presents OpenHEXAI, an open-source framework for human-centered evaluation of XAI methods. OpenHEXAI features (1) a collection of diverse benchmark datasets, pre-trained models, and post hoc explanation methods; (2) an easy-to-use web application for user study; (3) comprehensive evaluation metrics for the effectiveness of post hoc explanation methods in the context of human-AI decision making tasks; (4) best practice recommendations of experiment documentation; and (5) convenient tools for power analysis and cost estimation. OpenHEAXI is the first large-scale infrastructural effort to facilitate human-centered benchmarks of XAI methods. It simplifies the design and implementation of user studies for XAI methods, thus allowing researchers and practitioners to focus on the scientific questions. Additionally, it enhances reproducibility through standardized designs. Based on OpenHEXAI, we further conduct a systematic benchmark of four state-of-the-art post hoc explanation methods and compare their impacts on human-AI decision making tasks in terms of accuracy, fairness, as well as users' trust and understanding of the machine learning model.

dataset, explanation method, user study, (13 more...)

arXiv.org Artificial Intelligence

2403.05565

Country:

North America > United States > Illinois > Cook County > Chicago (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > North Carolina (0.04)
(2 more...)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > Experimental Study (0.69)
Research Report > New Finding (0.68)

Industry: Health & Medicine (0.47)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)

Add feedback

Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence

McIntosh, Timothy R., Susnjak, Teo, Liu, Tong, Watters, Paul, Halgamuge, Malka N.

arXiv.org Artificial IntelligenceFeb-15-2024

The rapid rise in popularity of Large Language Models (LLMs) with emerging capabilities has spurred public curiosity to evaluate and compare different LLMs, leading many researchers to propose their LLM benchmarks. Noticing preliminary inadequacies in those benchmarks, we embarked on a study to critically assess 23 state-of-the-art LLM benchmarks, using our novel unified evaluation framework through the lenses of people, process, and technology, under the pillars of functionality and security. Our research uncovered significant limitations, including biases, difficulties in measuring genuine reasoning, adaptability, implementation inconsistencies, prompt engineering complexity, evaluator diversity, and the overlooking of cultural and ideological norms in one comprehensive assessment. Our discussions emphasized the urgent need for standardized methodologies, regulatory certainties, and ethical guidelines in light of Artificial Intelligence (AI) advancements, including advocating for an evolution from static benchmarks to dynamic behavioral profiling to accurately capture LLMs' complex behaviors and potential risks. Our study highlighted the necessity for a paradigm shift in LLM evaluation methodologies, underlining the importance of collaborative efforts for the development of universally accepted benchmarks and the enhancement of AI systems' integration into society.

benchmark, evaluation, llm, (17 more...)

arXiv.org Artificial Intelligence

2402.0988

Country:

Oceania > Australia > Victoria > Melbourne (0.14)
Oceania > New Zealand > North Island > Auckland Region > Auckland (0.04)
Oceania > Australia > New South Wales > Sydney (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.68)
Research Report > Experimental Study (0.66)

Industry:

Law (1.00)
Information Technology > Security & Privacy (1.00)
Education (1.00)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.82)

Add feedback

Structure-based out-of-distribution (OOD) materials property prediction: a benchmark study

Omee, Sadman Sadeed, Fu, Nihang, Dong, Rongzhi, Hu, Ming, Hu, Jianjun

arXiv.org Artificial IntelligenceJan-15-2024

In real-world material research, machine learning (ML) models are usually expected to predict and discover novel exceptional materials that deviate from the known materials. It is thus a pressing question to provide an objective evaluation of ML model performances in property prediction of out-of-distribution (OOD) materials that are different from the training set distribution. Traditional performance evaluation of materials property prediction models through random splitting of the dataset frequently results in artificially high performance assessments due to the inherent redundancy of typical material datasets. Here we present a comprehensive benchmark study of structure-based graph neural networks (GNNs) for extrapolative OOD materials property prediction. We formulate five different categories of OOD ML problems for three benchmark datasets from the MatBench study. Our extensive experiments show that current state-of-the-art GNN algorithms significantly underperform for the OOD property prediction tasks on average compared to their baselines in the MatBench study, demonstrating a crucial generalization gap in realistic material prediction tasks. We further examine the latent physical spaces of these GNN models and identify the sources of CGCNN, ALIGNN, and DeeperGATGNN's significantly more robust OOD performance than those of the current best models in the MatBench study (coGN and coNGN), and provide insights to improve their performance.

algorithm, dataset, prediction, (16 more...)

arXiv.org Artificial Intelligence

2401.08032

Country: North America > United States > South Carolina > Richland County > Columbia (0.14)

Genre: Research Report (1.00)

Industry: Education (0.66)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

LOB-Based Deep Learning Models for Stock Price Trend Prediction: A Benchmark Study

Prata, Matteo, Masi, Giuseppe, Berti, Leonardo, Arrigoni, Viviana, Coletta, Andrea, Cannistraci, Irene, Vyetrenko, Svitlana, Velardi, Paola, Bartolini, Novella

arXiv.org Artificial IntelligenceSep-19-2023

The recent advancements in Deep Learning (DL) research have notably influenced the finance sector. We examine the robustness and generalizability of fifteen state-of-the-art DL models focusing on Stock Price Trend Prediction (SPTP) based on Limit Order Book (LOB) data. To carry out this study, we developed LOBCAST, an open-source framework that incorporates data preprocessing, DL model training, evaluation and profit analysis. Our extensive experiments reveal that all models exhibit a significant performance drop when exposed to new data, thereby raising questions about their real-world market applicability. Our work serves as a benchmark, illuminating the potential and the limitations of current approaches and providing insight for innovative solutions.

benchmark study, lob-based deep learning model, stock price trend prediction

arXiv.org Artificial Intelligence

2308.01915

Genre: Research Report (0.69)

Industry: Banking & Finance > Trading (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Interpretable Regional Descriptors: Hyperbox-Based Local Explanations

Dandl, Susanne, Casalicchio, Giuseppe, Bischl, Bernd, Bothmann, Ludwig

arXiv.org Machine LearningMay-4-2023

This work introduces interpretable regional descriptors, or IRDs, for local, model-agnostic interpretations. IRDs are hyperboxes that describe how an observation's feature values can be changed without affecting its prediction. They justify a prediction by providing a set of "even if" arguments (semi-factual explanations), and they indicate which features affect a prediction and whether pointwise biases or implausibilities exist. A concrete use case shows that this is valuable for both machine learning modelers and persons subject to a decision. We formalize the search for IRDs as an optimization problem and introduce a unifying framework for computing IRDs that covers desiderata, initialization techniques, and a post-processing method. We show how existing hyperbox methods can be adapted to fit into this unified framework. A benchmark study compares the methods based on several quality measures and identifies two strategies to improve IRDs.

artificial intelligence, machine learning, optimization problem, (13 more...)

arXiv.org Machine Learning

doi: 10.1007/978-3-031-43418-1_29

2305.0278

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
North America > United States > Arizona > Maricopa County > Phoenix (0.04)
Europe > United Kingdom > Scotland > City of Glasgow > Glasgow (0.04)
Europe > Switzerland (0.04)

Genre: Research Report > Experimental Study (0.46)

Industry: Banking & Finance (0.68)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Add feedback